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Abstract 

A linear non-Gaussian structural equation model called LiNGAM is 
an identifiable model for exploratory causal analysis. Previous methods 
estimate a causal ordering of variables and their connection strengths 
based on a single dataset. However, in many application domains, data are 
obtained under different conditions, that is, multiple datasets are obtained 
rather than a single dataset. In this paper, we present a new method to 
jointly estimate multiple LiNGAMs under the assumption that the models 
share a causal ordering but may have different connection strengths and 
differently distributed variables. In simulations, the new method estimates 
the models more accurately than estimating them separately. 



1 Introduction 

In causal discovery methods for continuous variables, it is typical [I] to assume i) 
the data generating process of observed variables Xi (i = 1, • • ■ , p) is graphically 
represented by a directed acyclic graph, that is, DAG; ii) the relations are 
linear; iii) there is no unobserved confounding variable. Let us denote by bij 
the connection strength from Xj to Xi and by k{i) a causal order of Xi in the 
DAG so that no later variable determines or has a directed path on any earlier 
variable. Without loss of generality, each observed variable Xi is assumed to 
have zero mean. Then we have 

X i ^ y bij Xj ~\~ G i , ( 1 ) 

k(j)<k(i) 

where are external influences that are continuous variables with zero means 
and non-zero variances and are mutually independent. The independence as- 
sumption between means that there is no latent confounding variable. In 
matrix form, the model ([1]) is written as 

x = Bx + e, (2) 
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where the connection strength matrix or adjacency matrix B collects 6y and the 
vectors x and e collect Xi and e*, respectively. Note that the matrix B can be 
permuted to be lower triangular with all zeros on the diagonal if simultaneous 
equal row and column permutations are made according to a causal ordering 
k(i) [2]. The problem of causal discovery is now to estimate the connection 
strength matrix B based on data x only. Each fry represents the direct causal 
effect of Xj on Xi and each ay, the (z, j)-th element of the matrix A = (I — B) _1 . 
the total causal effect of Xj on Xi [3] . In LiNGAM [J] , external influences et are 
assumed to be non-Gaussian. This makes it possible to identify a causal ordering 
k(i) without using any background knowledge on the structure [I]. Once a causal 
ordering k(i) is identified, the connection strengths by can be estimated by 
ordinary least squares regression [4] . Many works have been conducted to extend 
LiNGAM to more general cases [M3 and apply LiNGAM and its extensions on 
real- world problems [8HTU] . 

Previous methods for LiNGAM [4lflTj estimate a causal ordering k(i) based 
on a single dataset. However, in many applications, data are often obtained 
under different conditions. In neuroinformatics, fMRI (functional magnetic res- 
onance imaging) signals are frequently recorded for multiple subjects [I"2l[T3] . In 
bioinformatics, gene expression levels are measured under various experimental 
conditions [TJ. That is, multiple datasets are obtained rather than a single 
dataset. Since such data are likely to have some common structure, it would be 
effective in estimation to exploit the similarities [TSJ- It is well known j!5l[TC] 
that it could cause serious bias in estimation to naively concatenate multiple 
datasets into a single dataset and analyze it. Many sophisticated methods for 
such situations have been proposed in other statistical models [T4HT7] . How- 
ever, to our best knowledge, no such method is yet proposed in the literature 
of LiNGAM. Thus, in this paper, we propose a method to jointly estimate 
such LiNGAMs that share a causal ordering but may have different connection 
strength matrices and differently distributed external influences. 

2 Model definition 

We now define our model extending LiNGAM [4], that is, the model ([2]) with 
non-Gaussian external influences, to multiple group cases. We assume the fol- 
lowing data generating processes: 

4 9) = E (,9 = v--,c), (3) 

k(j)<m) 

where g is the group index, c is the number of groups, k(i) is such a causal 
ordering of variables that they form a DAG in each group g, xf', ef ^ and 
of) are the observed variables, external influences, and connection strengths 

of Group <7, respectively. External influences ef^ are continuous non-Gaussian 
variables with zero means and non-zero variances and are mutually independent. 
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In matrix form, the model ([3]) is written as 

x (s) = B (9) x (g) +e (g) ( g = 1, . - . , c ). (4) 

The connection strength matrix B^ collects b\j and the vectors and e' 9 ' 

collect xf and ef \ respectively. The assumption that the models shares a 
causal ordering k(i) might be rather weak since a recent method |15j for conven- 
tional graphical models makes a much stronger assumption that the zero/non- 
zero patterns of connection strength matrices, that is, the DAG structures, are 
equivalent between the groups. 

3 A direct estimation method for multiple LiNGAMs 

First, we briefly review an estimation method for a single group in Subscction l3.ll 
and extend it to multiple group cases in Subsection 13.21 

3.1 Background: A direct method for a single group 

In [IT], a direct estimation method called DircctLiNGAM was proposed. Di- 
rectLiNGAM estimates causal orders one by one and eventually a causal order- 
ing of all the variables. An exogenous variables is a variable with no parents, 
and the corresponding row of B has all zeros. Therefore, an exogenous variable 
can be at the top of such a causal ordering that makes B lower triangular with 
zeros on the diagonal. Then we remove the effect of the exogenous variable from 
the other variables by regressing it out. We iterate this procedure until all the 
variables are ordered. Now the point is how we can find an exogenous variable. 
The following lemma of [UJ shows how it is possible: 

Lemma 1 Assume that the input data x strictly follows LiNGAM, that is, the 
model (0) with non-Gaussian external influences. This means that we assume 
that all the model assumptions are met and the sample size is infinite. Denote 
by rp the residual when Xi is regressed on xf. rf = Xi — c °^^ x ^ %j (i ^ ])■ 
Then a variable Xj is exogenous if and only if Xj is independent of its residuals 
r\ j) for alli^j. q 

To evaluate independence between a variable Xj and its residuals rj (i ^ j), 
we first evaluate pairwise independence between the variable and each of the 
residuals using a kernel-based estimator of mutual information called KGV [18] , 
which we denote by KGV(xi,r^), and subsequently compute the sum of the 
pairwise independence measures over the residuals. The non-negative estimator 
KGV asymptotically goes to zero if and only if the variables are independent 
[T8] . Thus we obtain the following statistic to evaluate independence between a 
variable Xj and its residuals rf . 

T kernel ( Xj ;U) = ]T KGV( Xj ,4 j) ), (5) 
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where U denotes the set of the subscripts of variables Xi, that is, U={1, • • • , p}. 

3.2 A direct algorithm for multiple groups 

In this subsection, we present a method to jointly estimate multiple LiNGAMs 
constraining their causal ordcrings to be identical. This would enable more 
accurate estimation of the LiNGAMs in Eq. (0} than estimating them separately 
since they share a causal ordering. 

Wc first extend Lemma [T] to multiple-group cases: 

Lemma 2 Assume that the input data x*- 9 -* (g — 1, • • • ,c) strictly follow the 
model (0). Denote by r$ A) the residual when xf is regressed on Xj. r = 

(S) COV(^ S) ,0;S 9) ) (g) ,. , . (g) . 

x\ (^) — J a var " ta °i' e x } ls exogenous for any g ij 

and only if the following holds for any g: x^ is independent of its residuals 

Proof (i) Assume that x^ is exogenous for any g. Applying Lemma[T]on the 
LiNGAM of each group g, x)- is independent of its residuals rf { As for all i =/= j. 
(ii) Assume that xf is not exogenous for some g. For the g, there exists such i 
that x^ is not independent of its residual because of Lemma [TJ From (i) 
and (ii), the lemma is proven. ■ 

Thus, to estimate a variable that is exogenous in every group, we propose to 
find the variable subscript of such a variable that minimizes the weighted sum 
of the individual independence measures in Eq. ([5]): 

c 

T(j-U) = Y, wi9)T ^mel(x ( j 9) ;U), (6) 
3=1 

where is the non-negative weight of Group g. Here, we simply take the 
sample size of Group g which we denote by as the weight , that is, 
w (g) — rj(ff). Then in each group we subtract the effect of the exogenous variable 
found from the other variables using least squares regression. Similarly to the 
single group case, we can find all the causal orders by iterating this. 

This is a deflation approach and enables estimation of sub-orderings instead 
of the whole ordering. It would be a more practical goal to estimate the first q 
(q < p and q < min(ri < - 1 ', ■ ■ ■ , n^ c ')) orders in a causal ordering rather than all 
the orders when the sample size is not very large compared to the number of 
variables [TUIIT^] since the number of causal orders to be estimated gets smaller. 
Moreover, it is necessary to stop the algorithm before the covariance matrices of 
residuals get singular when any of the sample sizes is smaller than the number 
of variables. 

We now present a direct algorithm to estimate a causal ordering and the 
connection strengths in the multiple LiNGAMs Q: 
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1. Given a set of p-dimensional random vectors (g = 1, ■ • • , c), a set of its 
variable subscripts U and p x data matrices of the random vectors as 
X( ff ) respectively, initialize an ordered list of variables K := and m := 1 
and choose the number of causal orders to be estimated q (q = 1, • • ■ , or p). 

2. Remove the mean from the data. 

3. Repeat until min(g,p — 1) subscripts are appended to K: 

(a) For each group g, perform least squares regressions of xf ^ on for all 
i G U\K (i =/= j) and compute the residual vectors rf), and the residual 

data matrix R$ from the data matrix for all j e U\K. Find the 
variable subscript m of such a variable that is most independent of its 
residuals: 

m = arg min^ T(j;U\K), (7) 

where T is the independence measure defined in Eq. ©. 

(b) Append m to the end of K . 

(c) Let XO) := R { , 9 \ and replace by r { , 9 \, that is, x<» := r^ 9) , 

v / (m) r J (m)' ' (m) 

(fl = l,--',c). 

4. Append the remaining variable to the end of K if q = p. 

5. For each group g, construct a q x q strictly lower triangular matrix 'B < f\ 
that is, a sub-matrix of R^ 9 ' by following the order in K, and estimate the 
connection strengths by using least squares regression on the original random 
vectors x^) and the original data matrix X.^ 9 K 

This essentially reduces to the original DirectLiNGAM [TT] if the number of 
groups is one, that is, c = 1. 



4 Simulations 

As a sanity check of our method, we performed two experiments with simulated 
data. The first experiment consisted of 100 trials. In each trial, we generated 
10 datasets with dimension p = 10 and sample sizes rS 1 ^ = ■ ■ ■ = TV- 5 ' = 50 
and n^ 6 ) = • • • = n^ 10 ) = 100 and applied our joint estimation method on the 
data. For comparison, we also tested two methods; 1) a separate estimation 
method that applies DirectLiNGAM [TT| on the datasets separately and 2) a 
naive method that creates a single dataset concatenating all the datasets and 
applies DirectLiNGAM on it. The sample sizes would not be very sufficient 
to reliably estimate the models separately [20]. Each dataset was created as 
follows: 
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1. We constructed the p x p connection strength matrix with all zeros and 
replaced every element in the lower-triangular part by independent real- 
izations of Bernoulli random variables with success probability s similarly 
to [H] ■ The probability s determines the sparseness of the model. We set 
the sparseness s so that the expected number of adjacent variables was 
half of the dimension. 

2. We replaced each non-zero entry in the connection strength matrix by a 
value randomly chosen from the interval [—1.5,-0.5] U [0.5, 1.5] and se- 
lected variances of the external influences from the interval [1,3]. The 
resulting matrix was used as the data-generating connection strength ma- 
trix B(f). 

3. We generated data with sample size by independently drawing the 
external influence variables ef from various 18 non-Gaussian distribu- 
tions used in [18] including super- and sub-Gaussian distributions and 
symmetric and asymmetric distributions |llj . 

4. The values of the observed variables xf' were generated using the connec- 
tion strength matrix B' 9 ' and external influences ef' . Then, we generated 
constants following the Gaussian distribution N(0, 4) and added them to 
the variables as their means. Finally, we permuted the variables according 
to a random ordering shared by all the groups. 

We computed the percentages of datasets where all the causal orders were 
correctly learned in the total datasets, that is, 1,000 datasets (= 10 datasets 
x 100 trials). The percentages for the joint estimation method, the separate 
estimation method and the naive method were 96.6%, 44.9% and 0.4%, respec- 
tively. Further, we computed the average squared errors between true connec- 
tion strengths and estimated ones. The average squared errors for the three 
methods were 0.02, 0.07 and 0.51, respectively. 

In the second experiment, we considered a situation with more variables 
than observations, that is, p > n^ 9 \ We generated 10 datasets with dimension 
p — 40 and sample size = • • • = = 10 and = ■ ■ ■ = n( 10 ) = 20 in 
the same manner as above and only estimated the first five variables in a causal 
ordering (q = 5). We repeated this procedure 100 times. The percentages of 
datasets for which the joint, separate and naive methods correctly estimated 
such first five variables were 92.1%, 43.1% and 44.4%, respectively. The average 
squared errors for the three methods were 0.09, 0.21 and 0.33, respectively. 

To sum up, our joint estimation method worked much better than the other 
two methods in the simulations as summarized in Fig. [TJ 

5 Conclusions 

We proposed a new joint estimation method for multiple LiNGAMs. The new 
method assumes that such LiNGAMs share a causal ordering, but the conncc- 
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p=10 and q=10 



p=40 and q=5 



96.6% 



100 



92.1% 




Joint Separate Naive 



Joint Separate Naive 



Figure 1: Percentages of datasets with successful causal order discovery for 
the three methods: our joint method, the separate method and the naive 
method. Left: p = 10 and nW = ••• = = 50 and =■■■ = = 100. 
All the causal orders were estimated (q = 10). Right: p = 40 and rS 1 * 1 = ■ ■ ■ = 
77,( 5 ) — 10 and = ■ ■ ■ = n^ 10 ' = 20. First five causal orders were estimated 
(q = 5). 



tion strengths of variables and the distributions of external influences may be 
different between the models. 



Acknowledgements 

We would like to thank Yoshinobu Kawahara, Takashi Washio, Satoshi Hara 
and Aapo Hyvarinen for interesting discussions. This work was supported by 
MEXT Grant-in- Aid for Young Scientists #21700302. 

References 

[1] P. Spirtcs, C. Glymour, and R. Schemes. Causation, Prediction, and 
Search. Springer Verlag, 1993. (2nd ed. MIT Press 2000). 

[2] K. Bollen. Structural Equations with Latent Variables. John Wiley & Sons, 
1989. 

[3] P. O. Hoyer, S. Shimizu, A. Kerminen, and M. Palviainen. Estimation of 
causal effects using linear non-gaussian causal models with hidden variables. 
International Journal of Approximate Reasoning, 49(2):362-378, 2008. 

[4] S. Shimizu, P. O. Hoyer, A. Hyvarinen, and A. Kerminen. A linear non- 
gaussian acyclic model for causal discovery. Journal of Machine Learning 
Research, 7:2003-2030, 2006. 



7 



[5] A. Hyvarinen, K. Zhang, S. Shimizu, and P. O. Hoyer. Estimation of a 
structural vector autoregressive model using non-Gaussianity. Journal of 
Machine Learning Research, 11:1709-1731, May 2010. 

[6] P. O. Hoyer, D. Janzing, J. Mooij, J. Peters, and B. Scholkopf. Nonlinear 
causal discovery with additive noise models. In D. Roller, D. Schuurmans, 
Y. Bengio, and L. Bottou, editors, Advances in Neural Information Pro- 
cessing Systems 21, pages 689-696. 2009. 

[7] G. Lacerda, P. Spirtes, J. Ramsey, and P. O. Hoyer. Discovering cyclic 
causal models by independent components analysis. In Proceedings of the 
24th Conference on Uncertainty in Artificial Intelligence (UAI2008), pages 
366-374, 2008. 

[8] L. Faes, S. Erla, E. Tranquillini, D. Orrico, and G. Nollo. An identifi- 
able model to assess frequency-domain Granger causality in the presence 
of significant instantaneous interactions. In Proceedings of the 32nd Annual 
International Conference of the IEEE Engineering in Medicine and Biology 
Society (EMBS2010), pages 1699-1702, 2010. 

[9] E. Ferkingsta, A. L0landa, and M. Wilhelmsen. Causal modeling and in- 
ference for electricity markets. Energy Economics, 33(3):404-412, 2011. 

[10] Y. Sogawa, S. Shimizu, A. Hyvarinen, T. Washio, T. Shimamura, and 
S. Imoto. Discovery of exogenous variables in data with more variables 
than observations. In Proceedings of the 20th International Conference on 
Artificial Neural Networks (ICANN2010), pages 67-76, 2006. 

[11] S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyvarinen, Y. Rawahara, 
T. Washio, P. O. Hoyer, and R. Bollcn. DirectLiNGAM: A direct method 
for learning a linear non-Gaussian structural equation model. Journal of 
Machine Learning Research, 12:1225-1248, April 2011. Forthcoming. 

[12] F. Esposito, T. Scarabino, A. Hyvarinen, J. Himberg, E. Formisano, S. Co- 
mani, G. Tedeschi, R. Goebel, E. Seifritz, and F. Di Salle. Independent 
component analysis of fMRI group studies by self-organizing clustering. 
Neurolmage, 25(l):193-205, 2005. 

[13] S. M. Smith, R. L. Miller, G. Salimi-Rhorshidi, M. Webster, C. F. Beck- 
mann, T. E. Nichols, J. D. Ramsey, and M. W. Woolrich. Network mod- 
elling methods for FMRI. Neurolmage, 54(2):875-891, 2011. 

[14] T. Shimamura, S. Imoto, R. Yamaguchi, M. Nagasaki, and S. Miyano. In- 
ferring dynamic gene networks under varying conditions for transcriptomic 
network comparison. Bioinformatics, 26(8):1064-1072, 2010. 

[15] R. E. Tillman. Structure learning with independent non-identically dis- 
tributed data. In Proceedings of the 26th International Conference on Ma- 
chine Learning (ICML2009), Montreal, Canada, pages 1041-1048. ACM, 
2009. 



[16] S. Y. Lcc and K. L. Tsui. Covariance structure analysis in several popula- 
tions. Psychometrika, 47(3):297-308, 1982. 

[17] S. Lee, J. Zhu, and E. Xing. Adaptive multi-task Lasso: with application 
to eQTL detection. In J. Lafferty, C. K. I. Williams, J. Shawe- Taylor, R.S. 
Zemel, and A. Culotta, editors, Advances in Neural Information Processing 
Systems 23, pages 1306-1314. 2010. 

[18] F. R. Bach and M. I. Jordan. Kernel independent component analysis. 
Journal of Machine Learning Research, 3:1-48, 2002. 

[19] A. Hyvarinen. Pairwise measures of causal direction in linear non-Gaussian 
acyclic models. In JMLR Workshop and Conference Proceedings (Proc. 2nd 
Asian Conference on Machine Learning, ACML2010), volume 13, pages 1— 
16, 2010. 

[20] Y. Sogawa, S. Shimizu, Y. Kawahara, and T. Washio. An experimen- 
tal comparison of linear non-Gaussian causal discovery methods and their 
variants. In Proceedings of 2010 International Joint Conference on Neural 
Networks (IJCNN2010), pages 768-775, 2010. 

[21] M. Kalisch and P. Biihlmann. Estimating high-dimensional directed acyclic 
graphs with the PC-algorithm. Journal of Machine Learning Research, 
8:613-636, 2007. 



9 



